Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.

Robust entity resolution using random graphs / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - (2018), pp. 3-18. (Intervento presentato al convegno 44th ACM SIGMOD International Conference on Management of Data (SIGMOD), Winner of the REPRODUCIBILITY AWARD, Class A++ (GII-GRIN rating) tenutosi a Houston, TX; USA) [10.1145/3183713.3183755].

Robust entity resolution using random graphs

Firmani Donatella;
2018

Abstract

Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.
2018
44th ACM SIGMOD International Conference on Management of Data (SIGMOD), Winner of the REPRODUCIBILITY AWARD, Class A++ (GII-GRIN rating)
Software; Information Systems
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Robust entity resolution using random graphs / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - (2018), pp. 3-18. (Intervento presentato al convegno 44th ACM SIGMOD International Conference on Management of Data (SIGMOD), Winner of the REPRODUCIBILITY AWARD, Class A++ (GII-GRIN rating) tenutosi a Houston, TX; USA) [10.1145/3183713.3183755].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1640576
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 19
  • ???jsp.display-item.citation.isi??? 13
social impact